Approaches for Source Retrieval and Text Alignment of Plagiarism Detection Notebook for PAN at CLEF 2013

نویسندگان

  • Leilei Kong
  • Haoliang Qi
  • Cuixia Du
  • Mingxing Wang
  • Zhongyuan Han
چکیده

In this paper, we describe our approach at the PAN@CLEF2013 plagiarism detection competition. In sub-task of Source Retrieval, a method combined TF-IDF, PatTree and Weighted TF-IDF to extract the keywords of suspicious documents as queries to retrieve the plagiarism source document is proposed. In sub-task of Text Alignment, a method based on sentence similarity is presented. Our text alignment algorism and similar sentences merging algorism, called Bilateral Alternating Merging Algorithm, are described in detail. The great development of Internet makes it easier for people to search, copy, save, and reuse online sources. Copying another author's text and claiming its authorship is called plagiarism [1]. During the last decade, automated plagiarism detection in natural languages have attracted considerable attention from research and industry, which takes the advantage of recent developments in related fields like information retrieval, cross-language information retrieval, natural language processing, machine learning, and artificial intelligence. PAN@CLEF is dedicated to providing an environment which consists of a large scale corpus of artificial plagiarism and detection quality measures to evaluate the algorithms of plagiarism detection. There are two sub-tasks in PAN@CLEF2013: source retrieval and text alignment. The remaining sections of this paper introduce the methods we have taken in this year's competition. The task of source retrieval is to retrieve all plagiarized sources while minimizing retrieval costs [2]. One document plagiarizes another document by simple cut–paste manipulations, minor or wholesale alternations and more ambiguity rewriting. One of the difficulties of efficiently detecting plagiarism source is to search the source in

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Guess Again and See if They Line up: Surrey's Runs at Plagiarism Detection Notebook for PAN at CLEF 2013

This paper briefly describes the approaches taken to the two subtasks of Source Retrieval and Text Alignment, in the Plagiarism Detection track at PAN 13. For the first of these, we reuse our PAN12 approach – which combines frequency and a contrastive corpus measure to select keywords for querying the ChatNoir search system; for the second, we reuse software that had previously featured in PAN1...

متن کامل

Diverse Queries and Feature Type Selection for Plagiarism Discovery Notebook for PAN at CLEF 2013

This paper describes approaches used for the Plagiarism Detection task in PAN 2013 international competition on uncovering plagiarism, authorship, and social software misuse. We present modified three-way search methodology for Source Retrieval subtask and analyse snippet similarity performance. The results show, that presented approach is adaptable in real-world plagiarism situations. For the ...

متن کامل

Source Retrieval for Plagiarism Detection from Large Web Corpora: Recent Approaches

This paper overviews the five source retrieval approaches that have been submitted to the seventh international competition on plagiarism detection at PAN 2015. We compare the performances of these five approaches to the 14 methods submitted in the two previous years (eight from PAN 2013 and six from PAN 2014). For the third year in a row, we invited software submissions instead of run submissi...

متن کامل

Developing Monolingual Persian Corpus for Extrinsic Plagiarism Detection Using Artificial Obfuscation: Notebook for PAN at CLEF 2015

The task of text alignment corpus construction at PAN 2015 competition consists of preparing a plagiarism corpus so that it can provide various obfuscation types and versatile obfuscation degrees. Meanwhile, its format and metadata structure should follow previous PAN plagiarism corpora. In this paper, we describe our approach for construction of a monolingual Persian plagiarism corpus that can...

متن کامل

PAN 2015 Shared Task on Plagiarism Detection: Evaluation of Corpora for Text Alignment: Notebook for PAN at CLEF 2015

In this paper we describe and evaluate the corpora submitted to the PAN 2015 shared task on plagiarism detection for text alignment. We received monoand cross-language corpora in the following languages (pairs): English, Persian, Chinese, and Urdu-English, English-Persian. We present an independent section for each submitted corpus including statistics, discussion of the obfuscation techniques ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2013